A protein network refinement method based on module discovery and biological information

Background The identification of essential proteins can help in understanding the minimum requirements for cell survival and development to discover drug targets and prevent disease. Nowadays, node ranking methods are a common way to identify essential proteins, but the poor data quality of the underlying PIN has somewhat hindered the identification accuracy of essential proteins for these methods in the PIN. Therefore, researchers constructed refinement networks by considering certain biological properties of interacting protein pairs to improve the performance of node ranking methods in the PIN. Studies show that proteins in a complex are more likely to be essential than proteins not present in the complex. However, the modularity is usually ignored for the refinement methods of the PINs. Methods Based on this, we proposed a network refinement method based on module discovery and biological information. The idea is, first, to extract the maximal connected subgraph in the PIN, and to divide it into different modules by using Fast-unfolding algorithm; then, to detect critical modules according to the orthologous information, subcellular localization information and topology information within each module; finally, to construct a more refined network (CM-PIN) by using the identified critical modules. Results To evaluate the effectiveness of the proposed method, we used 12 typical node ranking methods (LAC, DC, DMNC, NC, TP, LID, CC, BC, PR, LR, PeC, WDC) to compare the overall performance of the CM-PIN with those on the S-PIN, D-PIN and RD-PIN. The experimental results showed that the CM-PIN was optimal in terms of the identification number of essential proteins, precision-recall curve, Jackknifing method and other criteria, and can help to identify essential proteins more accurately.

to survive [1].In addition, essential proteins are associated with human disease-causing genes, and their identification and analysis can help in the design of drug targets.
Early studies of essential proteins were mainly conducted by wet experimental methods such as RNA interference [2], single gene knockout [3] and conditional gene knockout [4], which often have the drawbacks of being expensive and time-consuming, therefore, the identification of essential proteins by computational methods has become the current trend.
However, the centrality methods only use the topological features of protein interaction networks for assessing the importance of proteins, and thus it's difficult to obtain desired predictive performance.In recent years, researchers tended to integrate multiple biological information of proteins to help identify essential proteins more accurately.For example, Li [16] et al. and Tang [17] et al., proposed the PeC and the WDC methods by integrating the degree of co-expression between protein pairs in gene expression profiles and the edge clustering coefficients of their interactions.Qin et al. [18] proposed the LBCC method, which is based on network topological features and protein complex; Li et al. [19] pointed out that proteins in complex are more likely to be essential than proteins not present in the complex, and they proposed the UC method by combining protein complexes and topological features of PINs.Lei et al. [20] proposed the PCSD method that fuses the degree of protein complex involvement and subgraph density.Zhong et al. [21] used a dynamic threshold method to binarize gene expression values and proposed the JDC method to combine the co-expression states and edge clustering coefficients of protein pairs at multiple times.
Although these node ranking methods have made great progress in identifying essential proteins, most of them require the use of topological information of proteins in the PIN for identification of essential proteins, especially network-based centrality methods, which are highly dependent on the accuracy of the underlying PINs.However, most of the PINs obtained from high-throughput experiments have been found to contain false positives or false negatives [22], which may somewhat interfere with the identification accuracy of essential proteins by most node ranking methods.
To improve the identification accuracy of essential proteins, some researchers used biological information of proteins to filter out unreliable interactions between proteins in the PIN, thereby constructing a refined PIN to identify essential proteins for node ranking methods.For example, based on static PIN (S-PIN), Xiao et al. [23] removed from it some unreliable interactions by determining whether protein pairs were activated at the same time in terms of gene expression level data, and constructed a once-refined PIN (D-PIN).Subsequently, Li et al. [24] further removed some unreliable interactions from the DPIN by determining whether protein pairs appeared in the same subcellular compartment, and constructed a twice-refined PIN (RD-PIN).
Nevertheless, some researchers pointed out that PINs have modular characteristics [25][26][27], the essentiality of a protein is not only related to the protein itself, but also to the functional module in which the protein is located, and proteins within modules have higher similarity than those in other modules.Furthermore, Zotenko et al. [28] found that in PINs, a large number of essential proteins may be present in highly dense functional modules.The aforementioned studies focused only on the edges between protein nodes to refine the network, ignoring the modularity feature of PINs.Therefore, it is still a question worth exploring how to better utilize the modularity feature of PINs to construct an efficient PIN and improve the performances of node ranking methods.
For the identification of community structure in complex networks, researchers have proposed a series of module discovery algorithms.For example, algorithms based on modularity [29,30] and information-theoretic framework [31] can divide non-overlapping modules in complex networks; while the modules discovered by using cliquepercolation based [32] and edge-clustering based [33] methods can be overlapping.In particular, in recent studies, some researchers have made use of network structure and node attributes to cluster complex networks more accurately [34][35][36].For example, Hu et al. [35] and Yang et al. [36], developed two fuzzy-based graph clustering algorithms that well take into account the key dependencies between node embedding and resulting clustering.In our study, a modularity-based Fast-unfolding algorithm was used to partition PINs into modules and analyze the differences between modules.
We found that the biological and topological information contained in different modules of PIN varies greatly.For example, some modules are dense but contain few essential proteins, which may be counterproductive for identifying essential proteins in the PIN.Therefore, the identification and selection of critical modules is of great significance for the construction of higher quality PINs.That is to say, if the network can be refined properly in combination with the modularity of the PIN, the performance of the node ranking method in the PIN may be improved more effectively.
Based on this, in this paper, we proposed a network refinement method based on module discovery and biological information to improve the identification accuracy of essential proteins for node ranking methods.The idea is, for a PIN, firstly, to remove the interactions in some small connected subgraphs from the PIN; secondly, to divide the maximal connected subgraph into several closely connected modules by the Fastunfolding algorithm that fuses the modularity; thirdly, to select the critical modules by combining orthologous information and subcellular localization information of proteins and topological features of each module; finally, to construct a more refined PIN (CM-PIN) according to the selected critical modules.
To evaluate the effectiveness of the network refinement method proposed in this paper, two different species of Saccharomyces cerevisiae and Human sapiens were used for validation.We applied 12 node ranking methods (LAC, DC, DMNC, NC, TP, LID, CC, BC, PR, LR, PeC, WDC) on the S-PIN, D-PIN, and RD-PIN, and compared the results with those on the CM-PIN obtained on these networks, respectively.The experimental results showed that in terms of the identification number of essential proteins at top 100-600, Jackknifing method, the area under the precision-recall curves, sensitivity, specificity, positive predictive value, negative predictive value, F-measure, Matthews correlation coefficient and accuracy, the performances of the 12 node ranking methods on the CM-PIN are optimal.All of these prove that the network refinement method proposed in this paper can obtain a more efficient PIN, which is conducive to improve the identification accuracy of essential proteins for node ranking methods, and is superior to the existing refinement networks (D-PIN and RD-PIN).

Methods
In this section, first, we described how to build these three protein interaction networks: S-PIN, D-PIN, and RD-PIN.Second, we described how to screen the critical modules by the biological information of proteins and the topological features of each module, and constructed CM-PINs, on S-PIN, D-PIN, and RDPIN respectively, the overall steps of this approach were shown in Fig. 1.
Fig. 1 The overall steps of the construction of the CM-PIN.First, in the block of construction of D-PIN and RD-PIN, we combined static PIN (S-PIN) and gene expression profile to construct D-PIN, and then further combined subcellular localization information to construct RD-PIN.In this paper, corresponding CM-PINs will be constructed based on these networks.Secondly, in the block of construction of CM-PIN, the Step 1 is to extract the maximum connected subgraph of a given PIN; the Step 2 is to divide the maximum connected subgraph into several modules using the Fast-unfolding algorithm; and the Step 3 is to identify critical modules using the biological (orthologous information and subcellular localization information) and topological information of proteins; the Step 4 is to refine the given PIN and construct the CM-PIN according to the identified critical modules

S-PIN, D-PIN and RD-PIN
A static protein-protein interaction network (S-PIN) [37][38][39], is an undirected graph G S = (V S , E S ), where V S represents the set of proteins and E S represents the set of protein interactions.
A dynamic protein-protein interaction network (D-PIN) [23] is an edge-induced subgraph G D = (V D , E D ) of the S-PIN in terms of the gene expression levels of proteins, where V D = V S and E D ⊆ E S .Let e ik denotes the value of gene expression level of v i at time point t k .If e ik is greater than τ i , then v i is active at time point t k .for any (v i , v j ) ∈ E S , if both v i and v j are activated at time point t k , the interaction between them is preserved in E D , otherwise it is removed from E D .The activity threshold τ i of protein v i was calculated by using the following equation [25]: where μ i denotes the mean of the n time-point gene expression level values of the protein and σ i is the standard deviation of the gene expression level values of v i .In this paper, n = 36 for Saccharomyces cerevisiae and n = 64 for Human sapiens.
A refined dynamic protein-protein interaction network (RD-PIN) [24] is an edgeinduced subgraph G RD = (V RD , E RD ) of the D-PIN in terms of subcellular localization information of proteins, where be the 11 subcellular localization statuses of protein v i , where r = 11.If v i is in the mth subcellular compartment, then l m (v i ) = 1, otherwise l m (v i ) = 0.For any (v i , v j ) ∈ E D , only when l m (v i ) = l m (v j ) = 1, the interaction between v i and v j will be preserved in E RD , otherwise their interaction will be removed from the E RD .

Construction of the CM-PIN
The construction of the CM-PIN consists of four steps (the following steps are consistent with Fig. 1): Step 1: retaining interactions in maximal connected subgraphs, that is, to remove the interactions in the remaining small connected subgraphs of the given PIN; Step 2: module discovery based on Fast-unfolding algorithm, that is, to divide the obtained maximum connected subgraph into several modules using the Fast-unfolding algorithm; Step 3: detecting critical modules,that is, to screen out critical modules by using biological and topological information of modules; Step 4: refining the protein-protein interaction network, that is, to remove the interaction of non-critical modules in the original PIN and construct the CM-PIN.
The construction process of the CM-PIN is described in the following algorithm. (1) n Pan et al.BMC Bioinformatics (2024) 25:157

Algorithm: Construction of the CM-PIN
Step 1: retaining interactions in maximal connected subgraphs It has been found that PINs have scale-free properties [40,41].The scale-free property means that the degrees of the nodes in PIN obey a power-law distribution, so PIN belongs to a scale-free network.Considering that PIN is a disconnected graph and consists of several connected subgraphs, where most of the proteins and their interactions are present in a maximal connected subgraph, while the number of proteins and their interactions in some remaining connected subgraphs are very small.As shown in Table 1, we counted the proportion of interactions in the maximal connected subgraphs of the YDIP, YBioGRID and HDIP datasets to the original network interactions.Step

2: module discovery based on Fast-unfolding algorithm
It has been shown that PINs have modular properties [25,26], and the modularity reflects the presence of highly connected protein clusters in PINs.So far, the clustering of protein interaction networks is an effective method for module delineation.In the paper, the Fast-unfolding module discovery algorithm, a hierarchical clustering method, is used for module division of the PIN.The purpose of module partitioning is to make the connections within the partitioned modules tighter and the connections between modules sparser.In order to evaluate whether the module division is feasible, Newman et al. [29] proposed the concept of modularity.Defining e ii as the ratio of the sum of all connected edges within module i to the total number of edges in the network and a i as the ratio of the total number of neighboring nodes of nodes within module i to the total number of edges, the modularity Q can be expressed as: A larger modularity represents a tighter connection within the module, and conversely, a smaller modularity represents a sparser connection within the module, and when the modularity Q reaches its maximum value, the division of modules is optimal.
Blondel et al. [30] proposed a Fast-unfolding algorithm for discovering module structures on large networks, which is a heuristic algorithm based on modularity optimization.Compared with traditional module discovery algorithms, Fast-unfolding has lower time complexity on large-scale networks and stable results for module partitioning, which is the reason why this algorithm is chosen to partition modules in this paper.The implementation steps of Fast-unfolding algorithm are as follows: first, initialization, divide each protein node into different modules; second, for each protein node, try to divide it into the module where its neighboring nodes are located, calculate the modularity Q at this time, and judge whether the difference ΔQ between the modularity before and after the division is positive, if it is positive, accept this division, if not, abandon this division; third, repeat the above process until the modularity Q can no longer be increased, then the division of modules is completed, and is the set of modules and m is the number of module divisions.It is worth noting that the divided modules are non-overlapping.

Step 3: detecting critical modules
To determine the importance of each module, we used three features (i.e., orthologous information, subcellular localization information, and topological information of the module) to score each module in the PIN.
(1) Determine the importance of modules using orthologous information of proteins.
Studies have shown that essential proteins evolve much more slowly than nonessential proteins [42], i.e., essential proteins are more conserved.We believe that the modules containing more conserved proteins are more likely to be critical, and the (4) conserved properties of proteins can mainly be found in the orthologous information of proteins.Therefore, we calculate the Pearson correlation coefficient between each module and the protein orthologous information in the PIN as the first score of the module.For protein v i , let O(v i ) represent the set of reference organisms in which at least an orthologous protein pair including v i occurs, |O(v i )| is the orthologous score of v i , and the vector consisting of orthologous scores of all proteins in the PIN is represented by y.For a module c i , its vector is represented as xi that only contains 0 and 1 (1 if the protein is in the module c i , 0 otherwise).The Pearson correlation coefficient PC(c i ) between module c i and the orthologous scores is: where n is the number of proteins in the PIN, and μ xi and μ y are the mean values of xi and y.Thus, the set of possible critical modules selected based on the orthologous information of the proteins within the module is denote as C_orth = {c i |PC(c i ) ≥ th 1 }, where th 1 is a threshold value.
(2) Determine the importance of modules using subcellular localization information of proteins.
The importance of the protein is not only related to the orthologous information of the protein, but also to the subcellular localization information of the protein, which can identify the critical modules in the PIN from another perspective.We observed the number of times proteins and essential proteins were present in each subcellular compartment, and found that proteins and even essential proteins were most widely distributed in the nucleus.Therefore, we thought that the more times proteins within the module were present in the nucleus, the more likely that module was critical.For the module c i , we calculate the number of times the protein in module c i occurs in the nucleus as its second score, denoted by NSL(c i ): where N(c i ) is the number of times the protein within the module appears in the nucleus and n(c i ) is the number of nodes within the module.The set of the possible critical modules selected based on the subcellular localization information of the proteins within the module is represented by C_sub = {c i |NSL(c i ) ≥ th 2 }, where th 2 is a threshold value.
(3) Determine the importance of modules using topological characteristics of modules.
To identify the importance of the module, we also used the topological characteristics of each module in the network.It has been pointed out that a large number of essential proteins may exist in highly dense functional modules [28].Thus, we thought that the richer the interactions within the module, the more likely it is to play an important role in the whole network, so we calculated the topological characteristics of module c i as its third score, denoted by TF(c i ): (5) where I(c i ) is the number of interactions inside module c i , O(c i ) is the number of interactions between module c i and other modules, and n(c i ) is the number of nodes of module c i .And according to the topological characteristics of the module, modules less than th 3 are selected as the set of potentially non-critical modules, that is, C_ topo = {c i |TF(c i ) ≤ th 3 }, where th 3 is a threshold value.

Step 4: refining the protein-protein interaction network
Finally, we integrated the above three features of the modules to obtain the final selected critical modules, that is, and v j are both in the critical modules C_ critical, their interaction will be retained, otherwise their interactions will be removed from the E, thus obtain the finally refined E CM , resulting in a more refined CM-PIN,

Materials and datasets
We first performed a complete experiment using the Saccharomyces cerevisiae dataset, as this dataset is currently the most complete of all species and has been widely used to test various methods for identifying essential proteins.Then, we used the Human sapiens dataset to verify the validity of the proposed method.

Protein-protein interaction datasets and essential proteins
The two protein-protein interaction datasets from Saccharomyces cerevisiae used in this paper were downloaded from YDIP [43] and YBioGRID [44], which contain 15,166 and 52,833 interactions, respectively, covering 4746 and 5616 proteins.A dataset of protein-protein interactions from Homo sapiens was downloaded from HDIP [45], which contains 6892 interactions covering 4615 proteins.Essential proteins were collected from the following data sets [46][47][48]: DEG, MIPS, SGD, OGEE.The YDIP, YBioGRID, and HDIP datasets contain 1130, 1199 and 726 essential proteins, respectively.

Other biological information
(1) Gene expression profile: The gene expression profiles of the yeast and human datasets were downloaded from GSE3431 [49] and GSE86354 [50], respectively, containing 6,777 and 18,912 proteins.GSE3431 dataset records the observation data of 36 time points during three successive metabolic cycles and GSE86354 dataset records expression profiles across 8 tissue including 64 time points.
(2) Subcellular localization information: Subcellular location information for both species was downloaded from the COMPART-MENTS dataset [51], which both contain 11 subcellular compartments.(3) Orthologous information: Information on orthologous proteins of yeast and human was taken from Version 7 [52] and Version 8 [53] of the InParanoid database, which contain 100 and 162 genome-wide paired comparison sets, respectively.

Node ranking methods
To verify the performance of the CM-PIN, we used 12 typical node ranking methods (DC [6], LAC [7], NC [8], DMNC [9], TP [10], LID [11], CC [12], BC [13], PR [14], LR [15], PeC [16], WDC [17]) and compared their performances of the identification of essential proteins on the CM-PIN with that on the S-PIN and two existing refinement networks (D-PIN [23] and RD-PIN [24]).The node ranking method will first calculate the importance scores of all protein nodes in the network according to its formula, then rank the proteins in descending order according to the importance scores, and finally a part of highly ranked proteins will be considered as essential proteins.

Analysis of the number of essential proteins identification
In order to prove that the network refinement method proposed in this paper can effectively improve the number of essential proteins identified by each node ranking method, we obtained more efficient CM-PINs on the SPIN, DPIN and RDPIN of the YDIP and YBioGRID datasets, respectively.And the numbers of essential proteins identified by node ranking methods at top 100, top 200, top 300, top 400, top 500, and top 600 on the CM-PIN were compared with their performance on the S-PIN, D-PIN, and RD-PIN, as shown in Tables 2 and 3. We denoted CM-PIN refined from S-PIN (D-PIN or RD-PIN) by CM-PIN(S) (CM-PIN(D) or CM-PIN(RD)), and marked the optimal item in bold when comparing two or more items in all subsequent tables.It can be seen that the CM-PIN can significantly improve the identification accuracy of essential proteins by node ranking methods on yeast datasets, whether it is static PIN or refined PIN, and the values of top 100-top 600 on the CM-PIN are higher than those of the other three existing PINs.Compared with different PINs, the average improvement Table 2 Comparison of the number of essential proteins identified by 12 node ranking methods on the S-PIN, D-PIN, RD-PIN and the CM-PIN at top 100-600 on YDIP dataset ratio of 12 node ranking methods at top 600 on YDIP and YBioGRID datasets was: 9.82% and 20.58% for the CM-PIN refined on the S-PIN; 11.30% and 15.15% for the CM-PIN refined on the D-PIN; 9.65% and 7.79% for the CM-PIN refined on the RD-PIN.And even some node ranking methods have a significant improvement, for example, compared with the S-PIN, the BC method has improved by 18.22% at top 600 on the CM-PIN on YDIP dataset; compared with the D-PIN, the CC method has improved by 56.74% at top 600 on the CM-PIN on YBioGRID dataset.In addition, the LID method was able to identify 405 essential proteins at top 600 on the CM-PIN refined on the RD-PIN on YDIP dataset, which has a very high identification accuracy.All of these illustrated the effectiveness of our method and demonstrate that CM-PIN is a more refined and effective network.
It is worth noting that the focus of this paper is to improve the overall performance of node ranking methods, so we pay more attention to the accuracy of these methods at top 1130 for YDIP (top 1199 for YBioGRID, or top 7,26 for HDIP).Meanwhile, the accuracy at top 100 can also receive a certain increase at this case.On the other hand, if we want to focus on the improvement of the performance at the top 100, we can also achieve good results in the accuracy of the top 100 by adjusting the parameters of our method appropriately.For example, when setting the parameters th 1 = 0.1, th 2 = 2, and th 3 = −2, the CM-PIN(RD) for YBioGRID can significantly improve the top 100 values of the node ranking methods.However, their top 1199 values will decline to a certain extent at this time.Therefore, the readers can strengthen the specified performance index by adjusting the parameters according to their own concerns.

Validated by using the Jackknifing method
In order to evaluate the overall performance of CM-PIN more comprehensively, we used the Jackknifing method [24,54].The horizontal axis of the Jackknifing plot indicates the number of proteins that ranked high in the network and the vertical axis represents the number of essential proteins among these top-ranked proteins.Figures 2 and 3 showed the number of essential proteins in the top K highest scoring proteins for each node ranking method in S-PIN, D-PIN, RD-PIN and CM-PIN (the CM-PIN with the best performance of the node ranking method among the three CM-PINs is selected), Among them, K is the number of essential proteins, K = 1130 and K = 1199 on YDIP and YBioGRID respectively.It is obvious that on the CM-PIN, the Jackknifing curves of these methods are all above the other three networks on both two yeast datasets, and the differences are significant, whether it is neighborhood-based, path-based or eigenvector-based centrality methods, even the node sorting methods that integrates multiple biological information.This further demonstrated that the network refinement method in this paper is effective in removing noise and false positives from protein interaction networks and proved that the CM-PIN is a more efficient network.

Analysis of precision-recall curves
As the identification of essential proteins is a sample imbalance problem, the number of negative class samples (non-essential proteins) is much larger than the number of positive class samples (essential proteins).When it comes to identifying essential proteins, we tend to more concerned with how many positive samples (essential proteins) can be identified [55].Therefore, to assess the significance of the CM-PIN, we used precision-recall curves to compare the efficiency of essential protein identification of 12 node ranking methods (see Figs. 4 and 5).The vertical axis (precision) of the precision-recall curve reflects the proportion of the true positive examples in the positive examples determined by the classifier, and the horizontal axis (recall) reflects the proportion of the positive examples determined by the classifier in the total positive examples.What's more, we further calculated the area under the precision-recall curve (PRAUC), as shown in Table 4, and it can be seen that both the precision-recall curves and PRAUC values on the CM-PIN of two yeast datasets were the best.The improvement rate of PRAUC value of 12 node ranking methods on the CM-PIN on YDIP and YBioGRID was: 3.28%-18.29%and 7.18%-54.62%for S-PIN; 5.85%-17.36%and 6.81%-38.55%for D-PIN; 4.61%-15.70%and 0.50%-11.63%for RD-PIN.All of these proved the validity of the CM-PIN again.

Validated by accuracy
To further evaluate the overall performance of CM-PIN and the accuracy of essential protein identification, we used the following seven evaluation metrics: sensitivity (SN), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), F-measure (FM), Matthews correlation coefficient (MCC) and accuracy (ACC).Among them, the calculation formulas of sensitivity and recall are consistent, the calculation formulas of positive predictive value and precision are also consistent.
The top K proteins after the descending order of importance scores of proteins were assumed to be essential proteins (K = 1130 and K = 1199 are the number of essential proteins for the YDIP and YBioGRID), and the calculation formulas are as follows,   where TP is the correctly predicted essential protein, FP stands for the incorrectly predicted essential protein, TN refers to the correctly predicted non-essential protein, and FN represents the incorrectly predicted non-essential protein.
Tables 5 and 6 showed the comparison results of the 12 node ranking methods on the seven indicators of S-PIN, D-PIN, RD-PIN and CM-PIN (RD).It can be seen that the seven evaluation indicators of the 12 node ranking methods on the CM-PIN on two yeast datasets are both better than the other three networks, which indicates that   the method of refining networks by modules in this paper is feasible and can effectively improve the identification accuracy of essential proteins.

Selection and analysis of thresholds
In this section, taking the RD-PIN of the YDIP as an example, first, we described the concrete steps of construction of the CM-PIN on the basis of the RD-PIN and the motivation of using PIN's modular feature refining network.Then, we analyzed how to select the thresholds.Finally, we listed the thresholds used by all the CM-PINs built on the two yeast datasets in this paper.
On YDIP dataset, the optimal partitioning of modules was achieved by the Fastunfolding algorithm when the modularity Q = 0.7408, at which point the RD-PIN was partitioned into 26 modules.We calculated three metrics for each module in RD-PIN: PC, NSL, and TF (as shown in Table 7) by using the biological information of the proteins and the topological information of the modules in the network.We also observed the number and proportion of essential proteins in each module and found that there was variation between modules and that some modules with sparse interactions within modules or with little biologically important information contained few essential proteins, which may be the potential non-critical modules.For example, the NSL values of modules 1, 24, and 26 are zero, which means that the proteins in their modules do not appear in the subcellular compartments of the nucleus, and after the thresholds screening, they will likely be defined as non-critical modules.Therefore, in order to get a more effective network, we need to try to identify seemingly more critical modules in the network and remove some of the interactions in modules with less biological and topological information.
To obtain the variation rule of the effect of thresholds on the selection of critical modules and the performance of the network, according to the data distribution of three metrics in the module, we let th 1 ∈ {−0.02, −0.005, 0.015}, th 2 ∈ {1.5, 2}, th 3 ∈ {0.25, 0.5}, and listed the effect of the networks on the identification accuracy of essential proteins with different values of the thresholds, respectively (as shown in Table 8, the experimental results in the table are the performance of LID in different networks).The experimental results showed that when th 1 and th 2 were small and th 3 was large, more critical modules were selected.At this time, there was still a large amount of noise in the network that had not been eliminated and the improvement in identification accuracy of essential proteins was not significant, for example, when th 1 = −0.02,th 2 = 1.5 and th 3 = 0.5, the identification accuracy of essential proteins at top 600 and PRAUC have improved compared with RD-PIN, but the identification accuracy of essential proteins at top 1130 is not as good as RD-PIN.In contrast, when th 1 and th 2 were larger, fewer critical modules were selected.At this time, critical parts of the network may have been removed, and the improvement in the network's identification accuracy of essential proteins was not optimal, for example, when th 1 = 0.015, th 2 = 2 and th 3 = 0.5, the identification accuracy of essential proteins at top 1130 of LID in CM-PIN was still inferior to RD-PIN.Among them, the change of th 1 and th 2 has a greater impact on the selection of modules, because biological information can better assist in identifying essential proteins than the topology information of the network.When th 1 = −0.005,th 2 = 2 and th 3 = 0.25, the optimal CM-PIN on YDIP dataset is obtained.Finally, we listed in Table 9 the selection thresholds and module information of CM-PINs constructed in two datasets of yeast in this paper.

Analysis of reasons for the improvement of identification accuracy of essential proteins
In order to discuss the reason why the identification accuracy of essential proteins of each node ranking method on the CM-PIN is higher than that on the other three networks (S-PIN, D-PIN, RD-PIN), we also calculated the ratio of essential proteins in different proteins at top 600 of each node ranking method on the CM-PIN and the other three networks, as shown in Fig. 6.It can be seen that on the CM-PIN, each node ranking method can identify some different essential proteins that cannot be identified on   (RD-PIN) is inferior to that on the once-refined PIN (D-PIN) due to fewer raw interactions in the HDIP dataset.That is why the individual indexes of the WDC method on the CM-PIN (refined on the RD-PIN) are inferior to that of the RD-PIN.Compared with S-PIN, D-PIN and RD-PIN, the CM-PINs can improve the PRAUC values of 12 node ranking methods to 14.37%-47.57%for S-PIN, 6.41%-24.90% for D-PIN, and 11.23%-28.11%for RD-PIN.Therefore, this proves that the network refinement method in this paper is applicable to multiple species, and can improve the performance of the node ranking method by obtaining more efficient network CM-PIN.

Conclusions and perspectives
In this paper, we proposed a protein interaction network refinement method based on modular discovery and biological information.Firstly, we extract the maximum connected subgraph of a given PIN and use a module discovery algorithm Fast-unfolding to divide it into different modules.Secondly, we select critical modules by using protein orthologous information, subcellular localization information, and its topological information in the PIN.Thirdly, we construct a more refined network (CM-PIN) according to the identified critical modules.
In order to verify the effectiveness of this method, we constructed CM-PINs based on three networks (S-PIN, D-PIN and RD-PIN) of two species (Saccharomyces cerevisiae and Human sapiens) and compared the performances of 12 node ranking methods (LAC, DC, DMNC, NC, TP, LID, CC, BC, PR, LR, PeC, WDC) on the CM-PIN with those on the three networks.In terms of the identification number of essential proteins at top 100-600, Jackknifing method, the area under the precision-recall curves (PRAUC), sensitivity (SN), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), F-measure (FM), Matthews correlation coefficient (MCC) and accuracy (ACC), the identification performances of node ranking methods on the CM-PIN are better than that of the S-PIN, D-PIN and RD-PIN.Among them, on the three datasets of Saccharomyces cerevisiae (YDIP and YBioGRID) and Human sapiens (HDIP), compared with the existing three networks, the highest improvement rate of PRAUC value of each node ranking method on the CM-PIN was 18.29%, 54.62%, 47.57% for S-PIN; 17.36%, 38.55%, 24.90% for D-PIN; and 15.70%, 11.63%, 28.11% for RD-PIN.The results demonstrated that the CM-PIN could effectively filter out false positives and false negatives and thus is a higher-quality network.
In future work, we will consider further contributing to the identification of essential proteins, the revelation of disease mechanisms and the design of targeted drug from the following three perspectives.Firstly, from the perspective of network refinement, the modular characteristics of the network can be combined with other factors to construct a more efficient network.For example, other biological information of proteins can be used to further refine some unreliable interactions within critical modules, such as structure information or annotation information of proteins.Secondly, from the perspective of module discovery, different module discovery algorithms can attempt to obtain more accurate division results in protein-protein interaction networks, such as clustering algorithms based on biological sequences [56] and attribute graphs [57].Thirdly, the modules discovered or the critical modules detected from the protein-protein interaction network can also be used as features to assist some other biological issues.For example, the classification task of Golgi protein [58], the classification task of microorganisms' function proteins [59], design of protein acetylation sites [60], etc.

Fig. 2 .
Fig. 2. 12 node ranking methods are validated by the Jackknife methodology on YDIP dataset

Fig. 5
Fig. 5 Comparison of precision-recall curves of 12 nodes ranking methods on on YBioGRID dataset

Fig. 6
Fig.6 The comparison of the percentage of essential proteins on the CM-PIN with that on the other three networks in different proteins for each node ranking method on YDIP dataset

Table 5
Comparison of seven evaluation indices for 12 node ranking methods on YDIP datasets

Table 6
Comparison of seven evaluation indices for 12 node ranking methods on YBIOGRID datasets

Table 7
Biological and topological characterization of each module in the RD-PIN on YDIP dataset

Table 8
The variation of the effect of thresholds on the selection of critical modules and the performance of the network

Table 9
The selection thresholds and module information of CM-PINs constructed in YDIP and YBioGRID datasets

Table 10
The selection thresholds and module information of CM-PINs constructed on HDIP dataset